Shared-Memory Parallelization of MTTKRP for Dense
نویسندگان
چکیده
e matricized-tensor times Khatri-Rao product (MTTKRP) is the computational boleneck for algorithms computing CP decompositions of tensors. In this paper, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. e algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We benchmark sequential and parallel performance of our implementations, demonstrating high sequential performance and ecient parallel scaling. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to 7.4× over existing parallel soware. ACM Reference format: Koby Hayashi, Grey Ballard, Yujie Jiang, Michael J. Tobia. 2018. SharedMemory Parallelization of MTTKRP for Dense . In Proceedings of , , , 12 pages. DOI: 10.1145/nnnnnnn.nnnnnnn
منابع مشابه
Experiments with Cholesky Factorization on Clusters of SMPs
Cholesky factorization of large dense matrices is an integral part of many applications in science and engineering. In this paper we report on experiments with different parallel versions of Cholesky factorization on modern high-performance computing architectures. For the parallelization of Cholesky factorization we utilized various standard linear algebra software packages and present perform...
متن کاملRecursion based parallelization of exact dense linear algebra routines for Gaussian elimination
We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse...
متن کاملcient Parallelization of Unstructured Reductions on Shared Memory Parallel Architectures ?
This paper presents a new parallelization method for an efcient implementation of unstructured array reductions on shared memory parallel machines with OpenMP. This method is strongly related to parallelization techniques for irregular reductions on distributed memory machines as employed in the context of High Performance Fortran. By exploiting data locality, synchronization is minimized witho...
متن کاملComputer Science Technical Report Canonic Multi-Projection: Memory Allocation for Distributed Memory Parallelization
The Polyhedral model is now the accepted technology for automatic parallelization of affine control loop programs. It has been successful in automatically generating tiled shared memory parallel programs for shared memory platforms (plus vectorization). We address the challenges arising when we move toward distributed memory parallelization, based on wavefront execution of parameterized tiles. ...
متن کاملEfficient Parallelization of Unstructured Reductions on Shared Memory Parallel Architectures
This paper presents a new parallelization method for an ef-cient implementation of unstructured array reductions on shared memory parallel machines with OpenMP. This method is strongly related to parallelization techniques for irregular reductions on distributed memory machines as employed in the context of High Performance Fortran. By exploiting data locality, synchronization is minimized with...
متن کامل